In-class Exercise 4b

In this exercise, I learnt how perform multivariate analysis by plotting a heatmap plot in R

Sia Heng Guang https://www.linkedin.com/in/hengguang/ (MITB Analytics Track)https://scis.smu.edu.sg/
2022-02-12

Installing relevant libraries

Show code
packages = c('seriation', 'dendextend', 'heatmaply', 'tidyverse')

for (p in packages){
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

Importing data

Show code
wh <- read_csv("data/WHData-2018.csv")

Data Wrangling

Show code
row.names(wh) <- wh$Country
Show code
wh1 <- dplyr::select(wh, c(3, 7:12))
wh_matrix <- data.matrix(wh)

Building a static heatmap using heatmap() of base R

We can use the heatmap() function on the matrix to plot a heatmap.

The Rowv = NA and Colv = NA are used to switch off the dendograms which are on by default.

Show code
wh_heatmap <- heatmap(wh_matrix,
                      Rowv=NA, Colv=NA)

Show code
wh_heatmap <- heatmap(wh_matrix)

We can normalise it column-wise.

Show code
wh_heatmap <- heatmap(wh_matrix,
                      scale="column",
                      cexRow = 0.6, 
                      cexCol = 0.8,
                      margins = c(10, 4))

Building an interactive heatmap using heatmaply()

Show code
heatmaply(wh_matrix[, -c(1, 2, 4, 5)],
          scale = "column")

We can normalise it as well.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]))

Similar to normalising, we can use the percentise method to divide the values by the maximum rank. Each value is the percent of observations that is above or below the value.

Show code
heatmaply(percentize(wh_matrix[, -c(1, 2, 4, 5)]))

Clustering methods

We can use hierarchical clustering using the Euclidean distance and ward.D.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          dist_method = "euclidean",
          hclust_method = "ward.D")

However, to determine the best clustering method, the dend_expend() and find_k() functions of the dendextend package will be used

Show code
wh_d <- dist(normalize(wh_matrix[, -c(1, 2, 4, 5)]), method = "euclidean")
dend_expend(wh_d)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.6137851
2      unknown        ward.D2 0.6289186
3      unknown         single 0.4774362
4      unknown       complete 0.6434009
5      unknown        average 0.6701688
6      unknown       mcquitty 0.5020102
7      unknown         median 0.5901833
8      unknown       centroid 0.6338734

From this, we can see that the average method should be used as it has the highest optimum value.

Then, we can use find_k() to find the optimal number of clusters.

Show code
wh_clust <- hclust(wh_d, method = "average")
num_k <- find_k(wh_clust)
plot(num_k)

With that, we can now prepare the heatmap like below.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          dist_method = "euclidean",
          hclust_method = "average",
          k_row = 3)

Seriation

The clustering does not place the rows in a definite order. Hence, we need to use seriation package to optimise the distance between adjacent leaves (labels).

We can use the Optimal Leaf Ordering (OLO) by default to do this.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "OLO")

We can also use the Gruvaeus and Wainer (GW) for a faster heuristic.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "GW")

The option “mean” gives the output we would get by default from heatmap functions in other packages such as gplots::heatmap.2.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "mean")

The option “none” gives us the dendrograms without any rotation that is based on the data matrix.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "none")

Adding finishing touches

We can apply a colour palette to improve the aesthetics of the heatmap.

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          seriate = "none",
          colors = Blues)

We can pass additional arguments to the heatmap to further refine the figure.

In the code chunk below the following arguments are used:

Show code
heatmaply(normalize(wh_matrix[, -c(1, 2, 4, 5)]),
          Colv=NA,
          seriate = "none",
          colors = Blues,
          k_row = 5,
          margins = c(NA,200,60,NA),
          fontsize_row = 4,
          fontsize_col = 5,
          main="World Happiness Score and Variables by Country, 2018 \nDataTransformation using Normalise Method",
          xlab = "World Happiness Indicators",
          ylab = "World Countries"
          )